United States Patent m 

Crapo 



US005629846A 
[li] Patent Number: 
[45] Date of Patent: 



5,629,846 
May 13, 1997 



[54] METHOD AND SYSTEM FOR DOCUMENT 
TRANSLATION AND EXTRACTION 

[75] Inventor: Andrew W. Crapo, Scotia, N.Y. 

[73] Assignee: General Electric Company, 
Schenectady, N.Y. 



[21] AppL No.: 313,961 

[22] Filed: Sep. 28, 1994 

[51] Int Q. 6 

[52] U.S. CI „ — 

[58] Field of Search — 



[57] 



ABSTRACT 



G06F 17/22 

„ 395/785; 395/774 

...... 364/419.1; 395/600, 

395/145, 146, 148, 500 



[56] References Cited 

U.S. PATENT DOCUMENTS 

4,559,614 12/1985 Peeketal 364/900 

4,730,270 3/1988 Okajimaetal ...... 364/900 

4,881,197 11/1989 Fischer 364/900 

4,896,289 1/1990 Svinicki et al 364/927.92 

5,208,905 5/1993 Takakuraetal 395/148 

5,438,657 8/1995 Nakatsmi 395/148 

Primary Examiner— Gail O. Hayes 
Assistant Examiner— Frantzy Poinvil 
Attorney, Agent, or Firm— David C. Goldman; Marvin Sny- 
der 



32 



<^starT) 



receive the source 
document 



34 



select and extract 
portions from the 
source document 



36 



transform portions 
into format of the 
' target document 



38 



identify tags in the 
source and target 
documents and deduce 
the translation rule set 



40 



store the translation 
rule set 



42 



apply the translation 
rule set to the 
source document 



A method and system for translating an electronic document 
from one format to an electronic document in a second 
format Selected portions from a source document are 
extracted and transformed into the format of a target docu- 
ment A translation rule set is then deduced from the 
extracted portions and the transformed portions. The trans- 
lation rule set is then applied to the source document, 
producing a first draft If the translation rule set is unable to 
translate a portion from the source document, then the user 
is notified of the untranslatable portion. The user then 
provides examples of how the untranslatable portion should 
be translated into the format of the target document The 
translation rule set is then modified in accordance with the 
examples. Next, the modified translation rule set is applied 
to the source document, producing a second draft. The above 
steps are repeated until the source document has been 
completely translated into the format of the target document 
or until the user is satisfied with the translation. 
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METHOD AND SYSTEM FOR DOCUMENT Another object of the present invention is to provide a 

TRANSLATION AND EXTRACTION method and system that can quickly restructure an original 

source document into a target document by using selected 

BACKGROUND OF THE INVENTION examples. 

The present invention relates generally to electronic tex- 5 m theqesentinyention a user selects afew.P^^W. 

iiic piwuunuvcuuuu iwa^Bcuattujr iv ™, vim, «a a_source document which is to be translated , The selected 

tual documents, and more particularly to translating an ^v^v.^^,^ J . ^ , , 

*u- i * * portions contain vanous constructs and formats. The 

electronic textual document having a first format into an , ^ j ^ r . . 

electronic target document having a second format * «*fPf«™ ;^«h ra tra^edrrmuaUy by anedrtor 

° ° into the format of a target document. In the present 

Electronic textual documents exist in many different {Q mV e naon , thP^H^^^^han^^^niy-fh^a^ 0 r 

fo rmats r anging from plain ASCD ^ text to prc^e^^tor contr ol information is modifie d. Translation rules are then * 

yicWjj^ p<*p greater deduceaHrom Ine Vantfoimed selected portions and their 

interest and importance attached to storing documents in corresponding originals. The translation rules are used for 

formats which are public or which may be translated into mapping the entire source document to the target document 

public formats. One particular class of public formats is 15 if the translator is confronted wim constnicts or formats that 

known as Standard Generalized Markup Language (SGML), m not covered by the translation rules, then the user can 
which is a standard-based togging methodology thaLpio^ supply addit ional examples to extend the amount of trans- 
vWfis" U yiallorm ana^appucaaon inrfppenHp.nt rin™mftnt ||[M |1lfion ^ ^ prcsent invention is simple and does not 

wffife^owln^nfo^ require fa ^ or skm neces sary to write codec s in the 

jffiA,^°^°V° ^^j.^^ documeDt 20 above-mentioned translators. 

Thus, in accordance with the present invention, there is 

codes known as tags to build the document into its final ^ a for ^ an electronic source 

formatted form, mese standard-based tagging metnodolo- document 

having a first format into an electronic target 

gjes are gaining in use, especially in the publishing industry document havm a second formaL Ue method 

However, thmemts a na^ountrfdertomc mztnul sdecdn ^ ons of constructs mil formats from Source 

and paper documents available for scanning and use of document Then the selected portions are extracted from the" 

optical character recognition which are in nonstandard source document . ^ format of ^ extracted portions are 

formats that cannot be readily translated into SGML com- ^ ^ second format of ^ electromc target 

phant formats. document From me origmal and transformed portions, a set 

Since the value of a document is dependent upon its 30 0 f translation rules are deduced. The translation rule set is 
accessibility, there is further value added when the document applied to the electronic source document and used to 
can be displayed in a different environment using different translate the source document into the target document. As 
viewers which sometimes require different formats. Thus, the translation rule set is applied to the electronic source 
there is a need to be able to translate documents from one document, a first draft is produced, 
format or tagging scheme to another. Currently, there are 35 A i so> ^ accordance with the present invention, there is 
several types of document viewing/editing software avail- pr0 vided a system for translating an electronic source docu- 
able that provide internal or external translators that can go ment having a first format into an electronic ta rget docum ent 
from their own format to an industry standard and formats having a secondTormaT . ' IFe system " comprises a selecting 
of others. Essentially, these translators are written in low- m^a^foT^lfclngT)ortions from the source document 
level languages such as C or C++, or by using LEX and 40 having various constructs and formats. An extraction means 
YACC to construct parsers. LEX is a tool for building lexical extracts selected portions from the source document A 
analyzers which identify the next token in the character transforming means transforms the format of the extracted 
stream being processed. YACC is a tool for creating rule- portions into the second format of the electronic target 
based parsers which receive the stream of tokens from the document The transformation can be greatly facilitated by 
lexical analyzer and identify the pattern and ensure legal 45 importing the text into a WYSIWYG (What you see is what 
syntax. Once such a parser has been written to understand a you get) editor for ±c f oma ^ but me prescnt inven- 
particular format, code may then be written to output the tion ^ not assume any particular process for preparing the 
information in the target format Coding these translators is selected portions of the document A deducing means 
labor intensive and requires a great deal of time. Therefore, deduces a translation rule set from the original and trans- 
there is a need for an easy to use approach that translates 50 f oime d portions. A first applying means applies the transla- 
documents without requiring a lot of time and specialized tion ^ set t0 me e i ectron i c source document A first 
skill to write the translation code. producing means produces a first draft of the electronic 

In addition to the translation problem described above, target document as the translation rule set is applied to the 

there often exists a need to restructure a document If the electronic source document 

document is in an SGML^xanpatible format, restructuring 55 while the present invention will hereinafter be described 

may be done by simply editing the DTD (document type m connection with a preferred embodiment and method of 

definition). However, if the document is not in a standard use< ft w fli be understood that it is not intended to limit the 

format, it may be very difficult to restructure the document invention to this embodiment Instead, it is intended to cover 

Thus, if there was an easy to use translator, then this problem all alternatives, modifications and equivalents as may be 

would be able to be overcome by approaching the restruc- eo included within the spirit and scope of the present invention 
turing as a translation problem. _ as defined by the appended claims. 

SUMMARY OF THE INVENTION BRIEF DESCRIPTION OF THE DRAWINGS 

Therefore, it is a primary objective of the present inven- FIG. 1 is a block diagram of a system according to the 

tion to provide a method and system for translating docu- 65 present invention; 

ments from one format into another format which does not FIG. 2 is a schematic illustrating the translation of a 

require a lot of time and skill to use. source document into the format of the target document; 
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FIG. 3 is an example of a translation rule set; and transformed into the format of the target document 24 by the 

FIG. 4 is a flow chart describing the operation of the transformer 20. The construct tagger 18 identifies the tags 
present invention. extracted from the source document and the new tags 

formed from the transformation. In FIG. 2, portion A is 
DETAILED DESCRIPTION OF THE PRESENT 5 extracted and transformed into the format of the target 
INVENTION document and shown as portion A'. Similar extractions and 

transformations are made on the portions B, C, and D, 
FIG. 1 shows a block diagram of a system 10 according resulting in corresponding portions B\ C\ and D\ respec- 
to the present invention. The system includes a CPU/ tively. 

controller 12 which may, for example, be a general purpose 10 After transforming several portions 26, the translation 
computer such as a personal computer, a workstation, or a rule set is then deduced from these portions and the trans- 
micro- or minicomputer. The computer receives an elec- formed portions 28. The translation rule set is deduced by 
tronic source document 14 which may be a computer file or using the principles of case-base reasoning. In particular, the 
other type of electronically stored document with all of the translation rule set is deduced by examining the portions of 
original tagging/formatting information still in tact The tejilXie..AJBJ^p.^ 

input document may also be the result of scanning a source tK ^J} S^frf c -^ry' 9 ^^} m By examinin g enough cases 
document from hard copy (i.e.. paper) and using optical (i.C^pnal sele^eT pofSons and corresponding trans- 
character recognition and other techniques to capture, with f oimed P ordons )» rules be correlated on how to translate 
the text as much formatting information as possible. The doc « " Mo ^ of the Urget docu- 

electronic source document 14 may be generated from a 20 men ' M . e ^ le f a tr ? ns ^ 01 ! ^xule set 30 for portions 
commercially available word processing package such as 20 t^J^l f'^Zl ™^^£%\v P TJ 
Microsoft Word™, Framemaker™ WordPerfect™, or me £ n n ^^^ 

... ' * tra nslates into !nUKmext< nl>. Thus, rules can be deduced 

to indicate that tne tormaHiSubHA\ in the source doqqiflent 
An interactive knowledge extractor 16 such as a mouse will translat ejnto the" lonrHE IHDUi ! in theiarp rt 
and keyboard, selects various portions from the source 25 and the formafVen^ ^iinrala'slate mtothe format <nl>. 
document and extracts the selected portions therefrom. The Other rules can be derived from other portions and placed 'in 
extracted portions are then transformed into the format of a the translation rule set After the translation rule set has been 
target document 24 by a transformer 20 such as an editor. deduced, it is applied to the electronic source document 14. 
The various constructs and formats (tags) from the selected The example in FIG. 3 is an illustration of how the trans- 
portions are then identified by a construct tagger 18. In order 30 lation rules are deduced. Although the example is relatively 
to deduce a translation rule set, the CPU/Controller needs simple in scope, it is within the realm of the present 
the tags from the extracted portions and the new tags from invention to cover much more complicated cases. As the 
the target portions. The text then serves as a basis or translation rule set 30 is applied, the text and format in the 
reference to allow deduction of what are tags in both the old source document are translated into the format of the target 
and new representations. Once both types of tags are iden- 35 document, resulting in a draft 

tified by the construct tagger, translation rules may be An added feature of the present invention, is the error/ 
postulated. The CPU/Controller 12 then deduces the trans- warning log generator 25 which identifies portions of the 
lation rule set that is used to translate the source document source document that were unable to be translated by the 
into the target document The translation rule set is stored in translation rule set This feature depends upon the translator 
a translation rule base 22. After the translation rule set has 40 having a model of the source tagging scheme allowing it to 
been deduced, it is then applied by the CPU/Controller to the identify characters which are likely to be tags in the source 
electronic source document 14. As the translation rule set is document, but for which it has no translation rules, 
applied, the text and format in the source document are However, this model need not be difficult. In the example 
translated into the format of the target document (i.e., provided in FIG. 3, the model might be that characters 
another, commercially available word processing package) 45 between two back slashes are tags. In the present invention, 
resulting in a first draft. An error/warning log generator 25 the error/warning log generator 25 has the capacity to 
identifies portions of text in the source document that were provide assumptions of problems in the translation rule set 
unable to be translated by the translation rule set Hie that are preventing the translation of a certain portion of the 
translation rule set is then modified into a new translation source document Alternatively, the error/warning log gen- 
rule set by selecting additional examples or portions from 50 erator can notify the user of missing portions or rules from 
the source document and applying the technique described the translation rule set 30 or of conflicting examples which 
above. Then the new translation rule set is applied to the require additional information to allow discrimination and 
electronic source document These steps continue until the avoid conflicting rules. Upon receiving the error/warning 
source document has been completely translated into the log, the user interactively enters portions of text and format 
target document or until the user is satisfied with the 55 from the source document using the editor to transform the 
translation. portions of the source document into the corresponding 

FIG. 2 is a schematic illustrating the translation of the format of the target document. After enough portions of the 
source document 14 into the format and constructs of the source document have been selected, extracted, and 
target document 24. Within the source document 14 are transformed, the CPU/Controller 12 modifies the translation 
portions of text 26 labeled A, B, C, D, etc., having various 60 rule set into a new translation rule set. After the new 
constructs and formats. Note that the portions could be pages translation rule set has been formulated, it is then applied to 
of text or classes of documents and is not limited, to the electronic source document As the new translation rule 
pragraphs.as B shown in HU. 2. i ne^Mera^ve^lmbwledge set is applied, the text and format in the source document are 
extractor 16 selects various portions (e.g.. A, B, C, D) from again translated into the format of the target document, 
the source document and extracts the selected portions 65 resulting in a second draft 

therefrom. The text and the original tagging in the selected , If the second draft is not a complete translation of the 
portions are extracted. The extracted portions are then source document, then the eiror/warning log generator noti- 
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fies the user of other portions of text that were identified as 
being untranslatable. The user then selects more examples 
from the source document and the editor transforms them 
into the format of the target document After enough por- 
tions of the source document have been selected, extracted, 
and transformed, the translation rule set is modified into 
another translation rule set After the newest translation rule 
set has been formulated, it is then applied to the electronic 
source document As the new translation rule set is applied, 
the text and format in the source document are again 
translated into the format of the target document, resulting in 
a third draft The above steps continue until the source 
document has been completely translated into the target 
document or until the user is satisfied with the translation, 
FIG. 4 is a flow chart describing the operation of the 
present invention. At 32, the computer receives the elec- 
tronic source document 14. The interactive knowledge 
extractor selects various portions from the source document 
and extracts the selected portions at 34. The extracted 
portions are then transformed into the format of the target 
document at 36. The various constructs and formats from the 
source document and the transformed portions from the 
target document are identified at 38. In addition, the trans- 
lation rule set is then deduced at 38. The translation rule set 
is then stored at 40. The translation rule set is then applied 
to the electronic source document at 42. If the translation is 
not complete at 44, then portions of the source document 
that were unable to be translated by the translation rule set 
are identified at 46. The translation rule set is then modified 
into a new translation rule set at 48. After the new translation 
rule set has been formulated, it is then applied again to the 
electronic source document at 42. The above steps continue 
until the source document has been completely translated 
into the target document or until the user is satisfied with the 
translation. 

It is therefore apparent that there has been provided in 
accordance with the present invention, a method and system 
for translating a source document into the format of a target 
document that fully satisfy the aims, advantages and objec- 
tives hereinbefore set forth. The invention has been 
described with reference to several embodiments, however, 
it will be appreciated that variations and modifications can 
be effected by a person of ordinary skill in the art without 
departing from the scope of the invention. 

I claim: 

1. A method for translating an electronic source document 
having a first format into an electronic target document 
having a second format, the method comprising the steps of: 

selecting portions from the source document having vari- 
ous constructs and formats; 

extracting the selected portions from the source docu- 
ment; 

transforming the format of the extracted portions into the 
second format of the electronic target document; 

deducing a translation rule set from the extracted portions 
and the transformed portions; 

applying the translation rule set to the electronic source 
document; 

producing a first draft of the electronic target document as 

the translation rule set is applied to the electronic 

source document; and 
identifying portions from the electronic source document 

which were unable to be translated into the target 

document 

2. The method according to claim 1, further comprising 
modifying the translation rule set to account for the identi- 
fied untranslatable portions. 
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3. The method according to claim 2, wherein the step of 
modifying includes interactively entering example sets of 
various portions and segments in the second format that 
correspond to the identified untranslatable portions. 

4. The method according to claim 3, wherein the step of 
modifying comprises generating a new rule set containing 
rules for translating the identified untranslatable portions. 

5. The method according to claim 4, further comprising 
applying the new rule set to the electronic source document 

10 6. The method according to claim 5, further comprising 
producing a second draft of the electronic target document 
as the new translation rule set is applied to the electronic 
source document 

7. The method according to claim 6, further comprising 
!5 repeating the steps of identifying, modifying, generating, 

applying, and producing, until the electronic source docu- 
ment has been translated into the format of the electronic 
target document 

8. The method according to claim 1, wherein the step of 
20 transforming is performed by using an editor. 

9. The method according to claim 1, wherein the step of 
identifying comprises providing an error log identifying the 
portions of the electronic source document that were 
untranslatable. 

25 10. A method for translating an electronic source docu- 
ment having a first format into an electronic target document 
having a second format, the method comprising the steps of: 
selecting portions from the source document having vari- 
ous constructs and formats; 
30 extracting the selected portions from the source docu- 
ment; 

transforming the format of the extracted portions into the 
second format of the electronic target document; 

deducing a translation rule set from the extracted portions 
35 and the transformed portions; 

applying the translation rule set to the electronic source 
document; 

producing a first draft of the electronic target document as 
the translation rule set is applied to the electronic 
source document; 

identifying portions from the electronic source document 
which were unable to be translated into the electronic 
target document; 
45 modifying the translation rule set to account for the 
identified untranslatable portions, the modified transla- 
tion rule set being a new rule set; 

applying the new rule set to the electronic source docu- 
ment; and 

50 repeating the steps of producing, identifying, modifying, 
and applying, until the target document is in a desired 
format 

11. The method according to claim 10, wherein the step of 
identifying comprises providing an error log identifying the 

55 portions of the electronic source document that were 
untranslatable. 

12. The method according to claim 10, wherein the step 
of modifying includes interactively entering example sets of 
various portions and segments in the second format that 

go correspond to the identified untranslatable portions. 

13. A system for translating an electronic source docu- 
ment having a first format into an electronic target document 
having a second format, the system comprising: 

means for selecting portions from the source document 
65 having various constructs and formats; 

means for extracting the selected portions from the source 
document; 
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means for transforming the format of the extracted por- 
tions into the second format of the electronic target 
document; 

means for deducing a translation rule set from the 

extracted portions and the transformed portions; 
first means for applying the translation rule set to the 

electronic source document; 
first means for producing a first draft of the electronic 

target document as the translation rule set is applied to 

the electronic source document; and 
first means for identifying portions from the electronic 

source document which were unable to be translated 

into the target document. 

14. The system according to claim 13, further cornprising 
means for modifying the translation rule set to account far 
the identified untranslatable portions. 

15. The system according to claim 14, wherein the modi- 
fying means includes means for interactively entering 
example sets of various portions and segments in the second 20 
format that correspond to the identified untranslatable por- 
tions. 

16. The system according to claim 13, wherein the trans- 
forming means is an editor. 

17. The system according to claim 14, wherein the modi- 
fying means comprises means for generating a new rule set 
containing rules for translating the identified untranslatable 
portions. 

18. The system according to claim 17, further comprising 
second means for applying the new rule set to the electronic 
source document 

19. The system according to claim 18, further comprising 
second means for producing a second draft of the electronic 
target document as the new translation rule set is applied to 
the electronic source document. 

20. The system according to claim 19, further comprising 
second means for identifying more portions from the elec- 
tronic source document which were unable to be translated 
in the target document. 

21. The system according to claim 19, wherein the first 
and second identifying means provides an error log identi- 



25 



30 



35 



fying the portions of the electronic source document that 
were untranslatable. 

22. A system for translating an electronic source docu- 
ment having a first format into an electronic target document 
having a second format, the system comprising: 

means for selecting portions from the source document 

having various constructs and formats; 
means for extracting the selected portions from the source 

document; 

means for transforming the format of the extracted por- 
tions into the second format of the electronic target 
document; 

means for deducing a translation rule set from the trans- 
formed portions; 

first means for applying the translation rule set to the 
electronic source document; 

means for producing a first draft of the electronic target 
document as the translation rule set is applied to the 
electronic source document; 

means for identifying portions from the electronic source 
document which were unable to be translated into the 
electronic target document; 

means for modifying the translation rule set to account for 
the identified untranslatable portions, the modified 
translation rule set being a new rule set; 

second means for applying the new rule set to the elec- 
tronic source document; and 

means far repeating the steps of producing, identifying, 
modifying, and applying, until the target document is in 
a desired format. 

23. The system according to claim 22, wherein the iden- 
tifying means provides an error log identifying the portions 
of the electronic source document that were untranslatable. 

24. The system according to claim 22, wherein the modi- 
fying means includes interactively entering example sets of 
various portions and segments in the second format that 
correspond to the identified untranslatable portions. 
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